Convolutional Neural Networks

Project: Write an Algorithm for Landmark Classification


In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with '(IMPLEMENTATION)' in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section, and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

Note: Once you have completed all the code implementations, you need to finalize your work by exporting the Jupyter Notebook as an HTML document. Before exporting the notebook to HTML, all the code cells need to have been run so that reviewers can see the final implementation and output. You can then export the notebook by using the menu above and navigating to File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. Markdown cells can be edited by double-clicking the cell to enter edit mode.

The rubric contains optional "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. If you decide to pursue the "Stand Out Suggestions", you should include the code in this Jupyter notebook.


Why We're Here

Photo sharing and photo storage services like to have location data for each photo that is uploaded. With the location data, these services can build advanced features, such as automatic suggestion of relevant tags or automatic photo organization, which help provide a compelling user experience. Although a photo's location can often be obtained by looking at the photo's metadata, many photos uploaded to these services will not have location metadata available. This can happen when, for example, the camera capturing the picture does not have GPS or if a photo's metadata is scrubbed due to privacy concerns.

If no location metadata for an image is available, one way to infer the location is to detect and classify a discernible landmark in the image. Given the large number of landmarks across the world and the immense volume of images that are uploaded to photo sharing services, using human judgement to classify these landmarks would not be feasible.

In this notebook, you will take the first steps towards addressing this problem by building models to automatically predict the location of the image based on any landmarks depicted in the image. At the end of this project, your code will accept any user-supplied image as input and suggest the top k most relevant landmarks from 50 possible landmarks from across the world. The image below displays a potential sample output of your finished project.

Sample landmark classification output

The Road Ahead

We break the notebook into separate steps. Feel free to use the links below to navigate the notebook.


Step 0: Download Datasets and Install Python Modules

Note: if you are using the Udacity workspace, YOU CAN SKIP THIS STEP. The dataset can be found in the /data folder and all required Python modules have been installed in the workspace.

Download the landmark dataset. Unzip the folder and place it in this project's home directory, at the location /landmark_images.

Install the following Python modules:


Step 1: Create a CNN to Classify Landmarks (from Scratch)

In this step, you will create a CNN that classifies landmarks. You must create your CNN from scratch (so, you can't use transfer learning yet!), and you must attain a test accuracy of at least 20%.

Although 20% may seem low at first glance, it seems more reasonable after realizing how difficult of a problem this is. Many times, an image that is taken at a landmark captures a fairly mundane image of an animal or plant, like in the following picture.

Bird in Haleakalā National Park

Just by looking at that image alone, would you have been able to guess that it was taken at the Haleakalā National Park in Hawaii?

An accuracy of 20% is significantly better than random guessing, which would provide an accuracy of just 2%. In Step 2 of this notebook, you will have the opportunity to greatly improve accuracy by using transfer learning to create a CNN.

Remember that practice is far ahead of theory in deep learning. Experiment with many different architectures, and trust your intuition. And, of course, have fun!

(IMPLEMENTATION) Specify Data Loaders for the Landmark Dataset

Use the code cell below to create three separate data loaders: one for training data, one for validation data, and one for test data. Randomly split the images located at landmark_images/train to create the train and validation data loaders, and use the images located at landmark_images/test to create the test data loader.

All three of your data loaders should be accessible via a dictionary named loaders_scratch. Your train data loader should be at loaders_scratch['train'], your validation data loader should be at loaders_scratch['valid'], and your test data loader should be at loaders_scratch['test'].

You may find this documentation on custom datasets to be a useful resource. If you are interested in augmenting your training and/or validation data, check out the wide variety of transforms!

Question 1: Describe your chosen procedure for preprocessing the data.

Answer:

  1. Training datasets
    • I am first resizing the image to 300 pixels along the smallest dimension. This will reduce the size and give the following transforms a consistent size to work off of and make them faster
    • Images are then flipped horizontally with a 50% chance to vary the network input since there is not a huge amount of data
    • Next I am applying random rotation (+-10 degrees) and scaling (+-10%). Images are scaled down at most by a factor of 0.9 so they will never be smaller than the following random crop
    • the crop sizes are chosen in such a way that even at the highest rotation of 10 degrees I don't get black borders around the image. this should give the network more natural looking images with rotation to learn from
    • as the last step the image is resized to 256px square and a random center crop of 224px is finally used as network input
  2. Validation and test datasets
    • images are resized to 256 pixels and then a center crop of 224 pixels is used
    • the only transform is a horizontal random flip which will allow me to get a better sense of network generalization and increase the number of available images for testing
  3. the final crop size of 224 pixels is used so that after 5 Maxpool layers each halving the image we arrive at a 7x7 image which still has a centerpoint
  4. to split the training set into training and validation I am creating 2 datasets of the same images, then shuffle the indices randomly and take a fraction of 0.05 (5%) of the images as a validation set.
  5. the data loaders use memory pinning and 3 worker threads to make data loading more efficient and get a higher GPU usage. If data cannot get to the gpu fast enough the training speed falls dramatically

(IMPLEMENTATION) Visualize a Batch of Training Data

Use the code cell below to retrieve a batch of images from your train data loader, display at least 5 images simultaneously, and label each displayed image with its class name (e.g., "Golden Gate Bridge").

Visualizing the output of your data loader is a great way to ensure that your data loading and preprocessing are working as expected.

Initialize use_cuda variable

(IMPLEMENTATION) Specify Loss Function and Optimizer

Use the next code cell to specify a loss function and optimizer. Save the chosen loss function as criterion_scratch, and fill in the function get_optimizer_scratch below.

(IMPLEMENTATION) Model Architecture

Create a CNN to classify images of landmarks. Use the template in the code cell below.

Question 2: Outline the steps you took to get to your final CNN architecture and your reasoning at each step.

Answer: I used the VGG network as inspiration although I had to implement the smallest possible, VGG11. The GPU I use for training only has 4GB of ram which was limiting the network size and number of layers.

The input image size of 224x224 needs to be reduced while increasing the number of filters. Through halving we can go down to 7x7 with 5 Maxpool layers so I have chosen this number.

Within each block I am sticking to a repeatable architecture of a Conv2d layer with 3x3 kernel and 1px padding. Only the last Conv2d layer has a batch norm applied before the ReLU activation to improve training stability.

The number of filters doubles after each block with 256 filters after the last layer.

The final classification is done by a 2 linear layers, learning the mapping between the 7x7x256 convolution output and the 50 target classes. Batch normalization after the first layer is used to try and improve the stability and learning rate of the classification network.

A softmax layer as the last step maps the detected probabilities to the one-hot encoded output classes.

(IMPLEMENTATION) Implement the Training Algorithm

Implement your training algorithm in the code cell below. Save the final model parameters at the filepath stored in the variable save_path.

(IMPLEMENTATION) Experiment with the Weight Initialization

Use the code cell below to define a custom weight initialization, and then train with your weight initialization for a few epochs. Make sure that neither the training loss nor validation loss is nan.

Later on, you will be able to see how this compares to training with PyTorch's default weight initialization.

(IMPLEMENTATION) Train and Validate the Model

Run the next code cell to train your model.

(IMPLEMENTATION) Test the Model

Run the code cell below to try out your model on the test dataset of landmark images. Run the code cell below to calculate and print the test loss and accuracy. Ensure that your test accuracy is greater than 20%.


Step 2: Create a CNN to Classify Landmarks (using Transfer Learning)

You will now use transfer learning to create a CNN that can identify landmarks from images. Your CNN must attain at least 60% accuracy on the test set.

(IMPLEMENTATION) Specify Data Loaders for the Landmark Dataset

Use the code cell below to create three separate data loaders: one for training data, one for validation data, and one for test data. Randomly split the images located at landmark_images/train to create the train and validation data loaders, and use the images located at landmark_images/test to create the test data loader.

All three of your data loaders should be accessible via a dictionary named loaders_transfer. Your train data loader should be at loaders_transfer['train'], your validation data loader should be at loaders_transfer['valid'], and your test data loader should be at loaders_transfer['test'].

If you like, you are welcome to use the same data loaders from the previous step, when you created a CNN from scratch.

(IMPLEMENTATION) Specify Loss Function and Optimizer

Use the next code cell to specify a loss function and optimizer. Save the chosen loss function as criterion_transfer, and fill in the function get_optimizer_transfer below.

(IMPLEMENTATION) Model Architecture

Use transfer learning to create a CNN to classify images of landmarks. Use the code cell below, and save your initialized model as the variable model_transfer.

Question 3: Outline the steps you took to get to your final CNN architecture and your reasoning at each step. Describe why you think the architecture is suitable for the current problem.

Answer:

I chose VGG16 as a feature detector network since it is still reasonably sized and has great performance on image classification tasks.

Gradient calculation on the feature detector layer is then disabled since we want to use the pre-trained feature detector. I replaced the last linear layer in the classifier with one that outputs the correct number of training classes (50) and then reinitialized all the weights with a normal distribution. Without this reinitialization step the network will not learn and achieve very bad accuracy.

I then set the batch size to 24 because this still allows me to train the network on a 4GB GPU (it runs out of memory with a batch size of 32).

(IMPLEMENTATION) Train and Validate the Model

Train and validate your model in the code cell below. Save the final model parameters at filepath 'model_transfer.pt'.

(IMPLEMENTATION) Test the Model

Try out your model on the test dataset of landmark images. Use the code cell below to calculate and print the test loss and accuracy. Ensure that your test accuracy is greater than 60%.


Step 3: Write Your Landmark Prediction Algorithm

Great job creating your CNN models! Now that you have put in all the hard work of creating accurate classifiers, let's define some functions to make it easy for others to use your classifiers.

(IMPLEMENTATION) Write Your Algorithm, Part 1

Implement the function predict_landmarks, which accepts a file path to an image and an integer k, and then predicts the top k most likely landmarks. You are required to use your transfer learned CNN from Step 2 to predict the landmarks.

An example of the expected behavior of predict_landmarks:

>>> predicted_landmarks = predict_landmarks('example_image.jpg', 3)
>>> print(predicted_landmarks)
['Golden Gate Bridge', 'Brooklyn Bridge', 'Sydney Harbour Bridge']

(IMPLEMENTATION) Write Your Algorithm, Part 2

In the code cell below, implement the function suggest_locations, which accepts a file path to an image as input, and then displays the image and the top 3 most likely landmarks as predicted by predict_landmarks.

Some sample output for suggest_locations is provided below, but feel free to design your own user experience!

(IMPLEMENTATION) Test Your Algorithm

Test your algorithm by running the suggest_locations function on at least four images on your computer. Feel free to use any images you like.

Question 4: Is the output better than you expected :) ? Or worse :( ? Provide at least three possible points of improvement for your algorithm.

Answer: (Three possible points for improvement)

The output is pretty good although there are cases where the top suggestion is not the correct answer.

  1. From just the three answers it is not clear what the probabilities were. It would be better if the algorithm took the probability into consideration. If the top 1 probability is much greater than the next probabilities then it might make sense to only return the top predictions that are close. If all probabilities in the top 3 are similar the network basically has no clue what it's looking at so maybe no suggestion should be returned.
  2. Maybe the algorithm could be made to work with multiple images taken within a short period of time. Most users would take multiple images of the same landmark so by looking at more images and comparing predictions we can pick the prediction that scores highest most of the time.
  3. The network should have a category "unknown" so we can say that an image is not showing any of the known landmarks. Alternatively it could be a second specialized network that only has to learn whether an image shows a known landmark or not which could be run first and in the positive case the image can then be run through the landmark detector.
  4. Instead of resizing the input image to 224px we could resize to a larger size and use multiple crops of the same image to get multiple network predictions and hopefully increase accuracy overall.